Skip to content

Fix autoscaler returning fatal error on GOAWAY if no scaling seen #973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 15, 2025

Conversation

Sushisource
Copy link
Member

What was changed

Fix autoscaling pollers possibly considering a GOAWAY/connection closed thing as fatal when they shouldn't.

Why?

Don't want workers to die in this situation

Checklist

  1. Closes [Bug] Temporal Worker RuntimeError: Poll failure: Unhandled grpc error when polling #964

  2. How was this tested:
    Added unit tests, verified with long-running integ test. Unfortunately I could not find any reasonable way to simulate this exact weirdo goaway error thing with the fake in-memory grpc server I have, even after tons of research.

  3. Any docs updates needed?

@Sushisource Sushisource requested a review from a team as a code owner August 9, 2025 00:51
}

async fn fake_server<F>(response_maker: F) -> FakeServer
where
F: FnMut() -> Response<Body> + Clone + Send + Sync + 'static,
F: FnMut() -> BoxFuture<'static, Response<Body>> + Clone + Send + Sync + 'static,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change while trying to get the fake server to create the goaway. Couldn't do it, but, it's probably useful for the fake responses to be async anyway.

@Sushisource Sushisource force-pushed the fix-fatal-autoscale-poll-err branch from 6efa35e to 68a9191 Compare August 9, 2025 00:58
Comment on lines +555 to +561
// Only propagate errors out if they weren't because of the short-circuiting
// logic. IE: We don't want to fail callers because we said we wanted to know
// about ResourceExhausted errors, but we haven't seen a scaling decision yet,
// so we're not reacting to errors, only propagating them.
return !e
.metadata()
.contains_key(ERROR_RETURNED_DUE_TO_SHORT_CIRCUIT);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I get some background here? I expect Tonic to hide go-away and implicitly handle reconnects. We send a go away every 5m on average. Is there a situation where a regular non-worker client use may see some go-away?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bug in hyper or tonic that was worked around here: #811

But, the short-circuit that the autoscaler turns on so it can scale better on otherwise non-fatal errors makes this come through as fatal - so it gets ignored specifically here.

Copy link
Member

@cretz cretz Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(marking approved anyways, just trying to understand some strangeness here)

So IIUC the server by default sends a GoAway after 5m (aka soft connection close to tell you to stop making more calls after whatever is in flight) and then hard-closes the TCP connection after 7m (because 2m is enough time between soft and hard for you to never hit this in a properly behaving client).

So somehow we're sending RPC calls even after soft close and therefore hitting the 7m limit? If that is the case, there can be data loss in a rare/racy way if hard TCP death occurs during gRPC call (in our case it'd be a task timeout because server might send us task and then close conn). Or maybe Tonic is the one that's eagerly returning a Code::Cancelled w/ "connection closed" before it ever even makes the call? Obviously not important if it all works today, but it is a bit confusing to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the latter thing is the explanation, but yeah I agree it's confusing.

@Sushisource Sushisource force-pushed the fix-fatal-autoscale-poll-err branch from 68a9191 to a96360b Compare August 14, 2025 18:22
@Sushisource Sushisource merged commit 148774c into master Aug 15, 2025
16 of 18 checks passed
@Sushisource Sushisource deleted the fix-fatal-autoscale-poll-err branch August 15, 2025 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Temporal Worker RuntimeError: Poll failure: Unhandled grpc error when polling
2 participants